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Abstract. Reticulate events play an important role in determining evo- 
lutionary relationships. The problem of computing the minimum number 
of such events to explain discordance between two phylogenetic trees is a 
hard computational problem. In practice, exact solvers struggle to solve 
instances with reticulation number larger than 40. For such instances, one 
has to resort to heuristics and approximation algorithms. Here we present 
the algorithm CycleKiller which is the first approximation algorithm 
that can produce solutions verifiably close to optimality for instances 
with hundreds or even thousands of reticulations. Theoretically, the al- 
gorithm is an exponential-time 2-approximation (or 4-approximation in 
its fastest mode). However, using simulations we demonstrate that in 
practice the algorithm runs quickly for large and difficult instances, pro- 
ducing solutions within one percent of optimality. An implementation 
of this algorithm, which extends the theoretical work of [14], has been 
made publicly available. 



1 Introduction 
1.1 Background 

A phylogenetic tree is a model used in biology to represent the evolution- 
ary history of a set X of species (or taxa) [9|10| . They are trees whose 
leaves are bijectively labeled by X and whose internal vertices represent 
the ancestors of the species set; they can be rooted or unrooted. Since in 
a rooted tree edges have a direction, the concepts of indegree and outde- 
gree of a vertex are well defined. Binary rooted (phylogenetic) trees are 
rooted (phylogenetic) trees whose internal vertices have outdegree 2. 

Some biological events such as hybridization, recombination and hor- 
izontal gene transfer cannot be modeled by a tree. To represent an event 



in which a species derives its genes from different ancestors, trees are 
generalized to allow vertices with indegree two or higher, known as retic- 
ulations. This model is called a rooted phylogenetic network. For detailed 
background information we refer the reader to |11|13|15] . Note that there 
also exist models based on unrooted phylogenetic networks but these are 
not the subject of this article. 

Although phylogenetic networks are more general than phylogenetic 
trees, trees are still often the basic building blocks from which they are 
constructed. Specifically, there are many techniques available for con- 
structing gene trees. However, when more genes are analyzed, topological 
conflicts between individual gene phylogenies can arise for methodologi- 
cal or biological reasons (i.e. aforementioned reticulate phenomena such 
as hybridization). This led computational biologists to try and quantify 
the amount of reticulation that is needed to simultaneously explain two 
trees. 

To state this problem more formally, we say that a phylogenetic net- 
work N on X displays a phylogenetic tree T on X if T can be obtained 
from a subtree of iV by contracting edges. (Informally this means that T 
can be obtained from N by, for each reticulation vertex of N, "switching 
off" all but one of its incoming edges and then suppressing all indegeee-1 
outdegree-1 vertices). Given two rooted binary phylogenetic trees T\ and 
T2 on X , the problem then becomes to determine the minimum number of 
reticulations contained in a phylogenetic network N on X displaying both 
trees. The value we are minimizing is often called the hybridization num- 
ber and instead of the term phylogenetic network, the term hybridization 
network is often used. 

It is known that the problem of computing hybridization numbers is 
both NP-hard and APX-hard [4], but it is not known whether it is in 
APX (i.e. whether it admits a polynomial-time approximation algorithm 
that achieves a constant approximation ratio). 

Until recently, most research on the hybridization number of two bi- 
nary trees had focused on the question of how to exactly compute this 
value using fixed parameter tractable (FPT) algorithms. For an introduc- 
tion to FPT we refer to [8]. Algorithmic progress has been considerable in 
this area, with various authors reporting increasingly sophisticated FPT 
algorithms where the parameter in question is the hybridization number r 
of the two trees |3|5|7|18j . The fastest algorithms currently implemented 
are the algorithm available inside the package Dendroscope [12], based 
on pQ, and the algorithm HybpjdNet [5]. The fastest theoretical FPT 



algorithm has running time 0(3.18 r n) [TS], where n is the number of taxa 
in the trees. 

Such FPT algorithms do, however, have their limits. For NP-hard 
problems such as hybridization number, and under the assumption that 
P ^ NP, the running time still grows exponentially in r, albeit usually 
at a slower rate than algorithms that have a running time of the form 
r?/( r ) where / is some function of r. In practice this means that existing 
algorithms struggle to terminate for instances with hybridization number 
larger than 40. 

For such large and/or difficult trees one can try to generate heuristic 
or approximate solutions, but how far are such solutions from optimal- 
ly? In |14] we showed that the news is worrying. Indeed, we showed 
that polynomial-time constant-ratio approximation algorithms exist if 
and only if such algorithms exist for the problem Directed Feedback Ver- 
tex Set (DFVS). However, DFVS is one of the most well-studied problems 
in combinatorial optimization and to this day it is unknown if it permits 
such an algorithm. Pending a major breakthrough in computer science, it 
therefore seems difficult to build efficient algorithms which approximate 
hybridization number well. On the positive side, we showed that in poly- 
nomial time an algorithm with approximation ratio 0(log r log log r) is 
possible. However, this algorithm is purely of theoretical interest and is 
not useful in practice. 

1.2 A new algorithm: CycleKiller 

In this article we extend the theoretical work of [H] slightly and give 
it a practical twist to yield a fast approximation algorithm (which we 
have made publicly available as the program CycleKiller) that achieves 
a low constant ratio approximation, even for massive trees where the 
hybridization number runs into hundreds or even thousands. 

The worst-case running time of this approximation algorithm is expo- 
nential. However, as we demonstrate with experiments, the running time 
of the algorithm is in practice extremely fast and, when the hybridiza- 
tion number is large, typically orders of magnitude faster than Hybrid- 
Net or the algorithm in Dendroscope. Of course, those algorithms at- 
tempt to compute optimum solutions, whereas our algorithm only gives 
approximate solutions. Nevertheless, our experiments show that when 
CycleKiller is run in its most accurate mode of operation an approxi- 
mation ratio very close to 1 is not unusual, suggesting that the algorithm 
often produces solutions close to optimality and well within the worst-case 
approximation guarantee. 



Specifically, we describe an algorithm with approximation ratio c[d + 1) 
for the hybridization number problem by combining a c-approximation 
for the problem MAF (Maximum Agreement Forest) (see e.g. [16 18J) 
with a d- approximation for the problem DFVS. Both these problems are 
NP-hard so polynomial-time algorithms attaining c = 1 or d = 1 are 
not realistic. Nevertheless, there exist extremely fast FPT algorithms for 
solving MAF exactly (i.e. c = 1), the fastest is rSPR by Whidden, Beiko 
and Zeh |17|19j . Moreover, we observe that the type of DFVS instances 
that arise in practice can easily be solved using Integer Linear Program- 
ming (ILP) (and freely-available ILP solver technology such as GLPK), so 
d = 1 is also often possible. Combining these two exact approaches gives 
us an exponential-time approximation algorithm with worst-case approx- 
imation ratio 2 that for large instances still runs extremely quickly; this is 
the 2-approx option of CycleKiller. In practice, we have observed that 
the upper bound of 2 is often pessimistic, with much better approxima- 
tion ratios observed in experiments (1.003 for the simulations presented 
in this article). We find that this algorithm already allows us to cope with 
much bigger trees than HybridNet or the algorithm in Dendroscope. 

Nevertheless, for truly massive trees it is often not feasible to have 
c = 1. Fortunately there exist linear-time algorithms which achieve c = 3 
[18J . This, coupled with the fact that (even for such trees) it remains fea- 
sible to use an exact (d = 1) solver for DFVS, means that in practice we 
achieve a 4-approximation for gigantic trees; this is the 4-approx option of 
CycleKiller. Again, the ratio of 4 is a worst-case bound and we suspect 
that in practice we are doing much better than 4. However, this cannot be 
experimentally verified due to the lack of good lower bounds for such mas- 
sive instances. In any case, the main advantage of the 4-approximation is 
that it can without too much effort cope with trees with hundreds or thou- 
sands of taxa and hybridization number of a similar order of magnitude. 
An implementation of CycleKiller and accompanying documentation 
can be downloaded from http : //skelk . sdf -eu . org/ cyclekillerj Net- 
works created by the algorithm can be viewed in Dendroscope. 

1.3 Theoretical and practical significance 

We have described, implemented and made publicly available an algo- 
rithm with two desirable qualities: it terminates quickly even for massive 
instances of hybridization number and it gives a non-trivial guarantee of 
proximity to optimality. This is the first algorithm with such properties. 
The algorithm is a non-trivial marriage of MAF and DFVS solvers (both 



exact and approximate), meaning that further advances in solving MAF 
and DFVS will directly lead to improvements in CycleKiller. 

This article also improves the theoretical work given in [T?j , which also 
proposed using DFVS but beginning from a trivial Agreement Forest (AF) 
known as a chain forest. Here we use a more intelligent starting point: an 
(approximate) MAF, and it is this insight which makes a 2-approximation 
(rather than the 6-approximation implied by p3] ) possible when using an 
exact DFVS solver. Other articles have also had the idea of cycle-breaking 
in AFs: the advanced FPT algorithm of Whidden et al |18j - that, as far 
as we know, has not been implemented - and the algorithm HybridNet 
by Chen et al. [5] (also see [6]). However, both algorithms start the cycle- 
breaking from many starting points. In contrast, our algorithm requires 
only a single starting point, i.e. an (approximate) solution to MAF. 

2 Preliminaries 

Let X be a finite set (e.g. of species). A rooted phylogenetic X-tree is 
a rooted tree with no vertices with indegree 1 and outdegree 1, a root 
with indegree and outdegree at least 2, and leaves bijectively labelled 
by the elements of X. We identify each leaf with its label and use L(T) 
to refer to the leaf set (or label set) of T. A rooted phylogenetic X-tree 
is called binary if each nonleaf vertex has outdegree two. We henceforth 
call a rooted, binary phylogenetic X-tree a tree for short. For a tree T 
and a set X' C X, we use the notation T(X') to denote the minimal 
subtree of T that contains all elements of X' and T\X' denotes the result 
of suppressing all indegeee-1 outdegree-1 vertices in T(X'). 

We define a forest as a set of trees. Each element of a forest is called a 
component. Let T be a tree and T a forest. We say that J 7 is a forest for T 
if T\L(F) is isomorphic to F for all F £ T and the trees {T(L(F)), F € 
F} are vertex-disjoint subtrees of T whose leaf-set union equals L(T). 
If T\ and Ti are two trees, then a forest T is an agreement forest of T\ 
and T2 if it is a forest for T\ and T2. The number of components of JF is 
denoted 

We define cleaning up a directed graph as repeatedly suppressing 
indegree- 1 outdegree-1 vertices, removing indegree-0 outdegree-1 vertices 
and removing unlabelled outdegree-0 vertices until no such operation is 
possible. Observe that, if J 7 is a forest for T, T can be obtained from T 
by removing | T\ — 1 edges and cleaning up. 

Problem: Maximum Agreement Forest (MAF) 

Instance: Two rooted, binary phylogenetic trees T\ and T2. 



Solution: An agreement forest F of T\ and T 2 . 
Objective: Minimize \F\ — 1. 

The directed graph IG(Ti,T 2 ,F), called the inheritance graph, is the 
directed graph whose vertices are the components of F and which has an 
edge (F, F') precisely if either 

• there is a directed path in T\ from the root of T\(L(F)) to the root 
of Ti{L(F')) or; 

• there is a directed path in T2 from the root of T 2 (L(F)) to the root 
of T 2 (L(F')). 

An agreement forest T of T\ and T 2 is called an acyclic agreement forest 
if the graph IG(T\, T 2 , F) is acyclic. A maximum acyclic agreement forest 
(MAAF) of F\ and T2 is an acyclic agreement forest of T\ and T 2 with a 
minimum number of components. 

Problem: Maximum Acyclic Agreement Forest (MAAF) 
Instance: Two rooted, binary phylogenetic trees T\ and T 2 . 
Solution: An acyclic agreement forest F of F\ and T 2 . 
Objective: Minimize \F\ — 1. 

We use MAF(Ti,T 2 ) and MAAF(T 1) T 2 ) to denote the optimal so- 
lution value of the problem MAF and MAAF respectively, for an in- 
stance Ti,T 2 . 

A rooted phylogenetic network on X is a directed acyclic graph with no 
vertices with indegree 1 and outdegree 1 and leaves bijectively labelled by 
the elements of X. Rooted phylogenetic networks, which are sometimes 
also called hybridization networks, will henceforth be called networks for 
short in this paper. A tree T on X is displayed by a network N if T can 
be obtained from a subtree of N by contracting edges. A reticulation is 
a vertex v with 8~(v) > 2 (with S~(v) denoting the indegree of v). The 
reticulation number (sometimes also called hybridization number) of a 
network N with root p is given by 

r(A0=J»)-l). 

It was shown that the optimum to MAAF is equal to the optimum of the 
following problem [2]. 

Problem: MinimumHybridization 

Instance: Two rooted, binary phylogenetic trees T\ and T 2 . 



Solution: A rooted phylogenetic network N that displays T\ and T 2 . 
Objective: Minimize r(N). 

Moreover, it was shown that, for two trees Ti,T 2 , any acyclic agree- 
ment forest for T\ and T 2 with k + 1 components can be turned into a 
phylogenetic network that displays T\ and T 2 and has reticulation num- 
ber k, and vice versa. Thus, any approximation for MAAF gives an ap- 
proximation for MinimumHybridization. 

A feedback vertex set of a directed graph is a subset of the vertices that 
contains at least one vertex of each directed cycle. Equivalently, a subset 
of the vertices of a directed graph is a feedback vertex set if removing 
these vertices from the graph makes it acyclic. 

Problem: Directed Feedback Vertex Set (DFVS) 
Instance: A directed graph D. 

Goal: Find a feedback vertex set of D of minimum size. 
3 Main result 

We show how MAAF can be approximated by combining algorithms for 
MAF and DFVS. In particular, we will prove the following theorem. 

Theorem 1. If there exists a c- approximation for MAF and a d-approx- 
imation for DFVS, then there exists a {c+l)d-approximation for MAAF 
(and thus for MinimumHybridization^). 

Suppose there exists a c-approximation for MAF. Let T\ and T 2 be 
two trees and let M be an agreement forest returned by the algorithm. 
Then, 

|M| - 1 < c- MAF(Ti,T 2 ) < c-MAAF(Ti,T 2 ). (1) 

An M- splitting is an acyclic agreement forest that can be obtained 
from M by removing edges and cleaning up. 

Lemma 1. Let T\ and T2 be two trees and M an agreement forest ofT\ 
andT2- Then, there exists an M -splitting of size at most MAAF{T\,T2) + 
\M\. 

Proof. Consider a maximum acyclic agreement forest F of T\ and T 2 . For 
i G {1, 2}, F can be obtained from Tj by removing a set of edges, say E l F , 
and cleaning up. Moreover, also M can be obtained from Tj by removing 
a set of edges, say E l M , and cleaning up. 

Now consider the forest S obtained from T\ by removing Ej^ U Ej? 
and cleaning up. Then, 



• S is an agreement forest of T\ and T2 because it can be obtained 
from T2 by removing edges E 2 M U E 2 F and cleaning up; 

• S is acyclic because it can be obtained by removing edges from F, 
which is acyclic, and cleaning up; 

• S can be obtained from M by removing edges and cleaning up. 

Hence, S is an M-splitting. Furthermore, |»S| < + \E^\ + 1- The 
lemma follows since \E F \ = MAAF(Ti,T 2 ) and \M\=E\ S + 1. □ 

Let OptSplittingT li T 2 (M) denote the size of a minimum-size M-splitting. 
Combining Lemma [1] and equation ([I]), we obtain 

OptSplitting Tl T2 (M) - 1 < (c + l)MAAF(7i, T 2 ) (2) 

We will now show how to find an approximation for the problem of 
finding an optimal M-splitting. We do so by reducing the problem to 
DFVS. We construct an input graph D for DFVS as follows. For every 
vertex of M that has outdegree 2 (in M), we create a vertex in D. There 
is an edge in D from a vertex u to a vertex v precisely if in either T\ or Ti 
(or in both) there is a directed path from u to v. We claim the following. 

Lemma 2. A subset V of the vertices of D is a feedback vertex set of D 
if and only if removing V from M makes it an acyclic agreement forest. 

Proof. We show that D\V has a directed cycle if and only if the inher- 
itance graph of M \ V has a directed cycle. 

To prove that, first suppose that there is a cycle v 1, V2, ■ ■ ■ , = v\ in 
the inheritance graph of M \ V . The vertices in the inheritance graph of 
M\V' correspond to the roots of the components of M\V . Since these 
roots have outdegree 2 in M \ V, they had outdegree 2 in M, and are 
thus vertices of D. So the vertices v\, v%, . . . , that form the cycle are 
vertices of D. Since these vertices are in the inheritance graph of M \ V , 
they can not be in V' and so they are vertices of D\V . The reachability 
relation between these vertices in D \ V is the same as in the inheritance 
graph of M \ V'. So, the vertices v\,V2, ■ ■ ■ , Vk form a cycle in D \ V' . 

Now suppose that there is a cycle w\, W2, ■ ■ ■ , = w\ in D \ V' . Each 
of the vertices W\ , W2 , ■ ■ ■ , Wk 1S a vertex with outdegree-2 in M. Some 
of them might be roots of components, while others are not. However, 
observe that if there is a directed path from a vertex u to a vertex v 
in Ti (or in T2) then there is also a directed path from the root of the 
component of M \ V' that contains u to the root of the component of 
M\V' that contains v. Hence, there is a directed cycle in the inheritance 
graph of M \ V , formed by the roots of the components of M \ V that 
contain w\, W2 , ■ ■ ■ , Wk ■ □ 



Suppose that there exists a d-approximation for DFVS. Let FVS be 
a feedback vertex set returned by this algorithm and let MFVS be a min- 
imum feedback vertex set. Then, removing the vertices of MFVS from M 
gives an optimal M-splitting. Furthermore, OptSplittingr 1 ,T 2 (-^) = 1-^1+ 
| MFVS | . This is because for every vertex in a cycle C, its parent in M 
must participate in some cycle that contains elements of C. So if we start 
by removing the root of the component we are splitting and subsequently 
remove only those vertices whose parents have already been removed we 
see that we add at most one component per vertex. In fact, because ver- 
tices of D all have out-degree 2 in M, we add exactly one component per 
vertex. 

By removing the vertices of FVS from M, we obtain an acyclic agree- 
ment forest T such that 

\T\-\ = \M\ + |FVS| - 1 

< \M\ + d- |MFVS| - 1 

< d{\M\ + |MFVS| - 1) 

= d(OptSplitting Tl T2 (M) - 1) 

< d(c + l)MAAF(Ti, T 2 ), 

where the last inequality follows from equation j2]). Thus, T is a (c+ 1) el- 
approximation to MAAF, which concludes the proof of Theorem [TJ 

4 Practical experiments 

To assess the performance of CycleKiller, a simulation study was un- 
dertaken. We generated 3 synthetic datasets, an easy, a medium and a 
hard one, containing respectively 800, 640 and 640 pairs of rooted binary 
phylogenetic trees. 

The easy data set was created by varying two parameters, namely the 
number of taxa n and the number of rSPR-moves k used to obtain the 
second tree from the first (note that this number is an upper bound on the 
actual rSPR distance). The 800 pairs of rooted binary phylogenetic trees 
were created by varying n in {20, 50, 100, 200} and k in {5, 10, 25}, and 
then creating 40 different instances per each combination of parameters. 
Each pair (T\,T2) of rooted binary phylogenetic trees for a given set 
of parameters n and k is created as follows: The first tree T\ on X = 
{x%, . . . , x n } is generated by first creating a set of n leaf vertices bijectively 
labeled by the set X. Then, two vertices u and v, both with indegree 0, 
are randomly picked and a new vertex w, along with two new edges (w, u) 



and (w, v), is created. This is done until only one vertex with no ancestor, 
the root, is present. The second tree T2 is obtained from T\ by applying k 
rSPR-moves. The medium and the hard data sets were generated in the 
same way as the easy one, but for different choices of the parameters: n 
in {50, 100, 200, 300} and k in {15, 25, 40, 55} for the medium one and n 
in {100, 200, 400, 500} and k in {40, 60, 80, 100} for the hard one. 

The exact hybridization number has been computed by HybridNet 

[5], available from |http : //www . cs . cityu. edu. hk/~rwa ng/sof tware/Hn/treeComp .html 
or with DENDROSCOPE [12], available from |http : //www . dendroscope . org| We 

will refer to these algorithms as the exact algorithms. Each instance has 
been run on a single core of an Intel Xeon E5506 processor. 

Each run that took more than one hour was aborted. For each in- 
stance, we ran our program with the option 2-approx, and, in case the 
latter did not finish within one hour, we ran it again, this time using 
the option 4-approx, always with a one-hour limit. We used the pro- 
gram rSPR vl.03 |17)19| to solve or approximate MAF and GLPK v4.47 
(http://www.gnu.org/software/glpk/) to solve a simple ILP formula- 
tion of DFVS. 

For all instances of the easy data set, CycleKiller finished with 
the 2-approx option within the one hour limit, while for 33 instances 
the exact algorithms were unable to compute the hybridization number. 
Note that, even for "easy" instances, computing the exact hybridization 
number can take a very long time. To give the reader an idea, for 9 
runs of the easy data, Dendroscope and HybridNet did not complete 
within 10 days. Table 1 shows a summary of the results. It can be seen 
that CycleKiller was much faster than the exact algorithms. Moreover, 
for 96.6% of the instances for which an exact algorithm could find a solu- 
tion, CycleKiller also found an optimal solution. While the theoretical 
worst-case approximation ratio of the 2-approx option of CycleKiller 
is 2, in our experiments it performed very close to a 1-approximation. 

For the medium data set, CycleKiller finished with the 2-approx 
option for 613 instances, and for the remaining ones with the 4-approx 
option. The exact algorithms could compute the hybridization number 
for only 199 instances (out of 640). For 97.5% of these instances, Cy- 
cleKiller also found an optimal solution, but with a much better run- 
ning time. Regarding the hard data set, 444 runs were completed with 
the 2-approx option and for the remaining ones we were able to use the 
4-approx option within the given time constraint. Unfortunately, the ex- 
act algorithms were unable to compute the hybridization number for any 
tree-pair of this data set and hence we could not compute the average 
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Table 1. Experimental results. The third column indicates for how many instances at 
least one exact algorithm finished within one hour. The fifth column indicates for how 
many instances the 2-approx option of CycleKiller finished within one hour. For the 
remaining instances, the 4-approx option finished within one hour, as can be seen from 
the seventh column. The average running time for the 2-approx and the 4-approx are 
reported respectively in the sixth and eighth column. The average approximation ratio 
(ninth column) is taken over all instances for which at least one exact method finished. 
The last column indicates the percentage of those instances for which CycleKiller 
found an optimal solution. 

approximation ratios. Over all our experiments, the maximum hybridiza- 
tion number that the exact algorithms could handle was 250 In contrast, 
the 2-approx option of CycleKiller could be used for instances for 
which the size of a MAF was up to 97, and thus for instances for which 
the hybridization number was at least 97. 

To find the limits of the 4-approx option of CycleKiller, we also 
tested it on randomly generated trees. On a normal laptop, it could con- 
struct networks with up to 10,000 leaves and up to 10,000 reticulations 
within 10 minutes. Since the number of reticulations found is at most four 
times the optimal hybridization number, this implies that the 4-approx 
option of CycleKiller can handle hybridization numbers up to at least 
2,500. These randomly generated trees are, however, biologically mean- 
ingless and, therefore, we conducted the extensive experiment described 
above on trees generated by rSPR moves. 
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